TopoStats : An open-software and FAIR4RS success

Who am I?

  • Sylvia Whittle
  • Research Technician / Software Engineer
  • Background in Physics
  • Somehow now in a biolab?
  • Part of the Pyne Lab research group, based in the Royce Discovery centre in Sheffield
  • I like advocating for open software and research
  • (Hopefully) starting PhD in October

What is TopoStats?

Raw data

Flattened

Grains detected

  • Python toolkit for automated processing of atomic force microscopy (AFM) data.
  • Free and open-source research software
  • Developed by a small team at the University of Sheffield
  • Takes raw noisy, non-flat images
  • Flattens them
  • Detects structures in the data
  • Calculates statistics for the structures

TopoStats: A year ago

  • Very hard to install
  • Outdated dependencies
  • Contributing was a mess
  • Hard-coded values
  • No versioning, no releases
  • No review process
  • No tests
  • Buggy!

This meant that working on TopoStats was confusing, difficult and prone to errors and lost data / scripts

Hardly anyone used or knew about it

Dependencies

Gwyddion

  • AFM analysis software
  • Written in C++
  • Almost no code comments
  • Slow to use (GUI only)
  • No automation
  • No standardisation
  • Near impossible to contribute to / edit

PyGwy

  • Python binding to Gwyddion’s methods and functions.
  • Written in outdated Python 2.7
  • Will not be updated
  • Lacking in documentation
  • Difficult to contribute to

March 2022: Installation procedure

  • Uninstall all python, gwydidon, pygobject, pycairo, pygtk installations
  • Delete all caches of the above softwares
  • Install Anaconda 32 bit
  • Install python 2.7
  • Install pycharm (register for an account if necessary)
  • Install Gwyddion (From an unfamiliar website)
  • Download a set of additional files from Google drive, hosted by our lab
  • Set up the environment
    • Import the environment from the gwyconda.yml file.
    • Follow some images to determine which checkboxes to select.
    • Locate your python environment
    • Install the PyGTK2 packages:
      • Install PyGTK
      • Install PyCairo
      • Install PyGObject
      • Manually add the paths for these into Anaconda
    • Change the Gwyconda environment directory to the bin folder in Gwyddion
  • Set up PyCharm
    • Open a new project and set the interpreter to Gwyconda
    • Create a python file
    • Append the path of the bin folder in Gwyddion
    • Ignore all runtime warnings
  • (Go back to the start because something went wrong in the installation)

FAIR(4RS) Principles

  • Findable
    • Easy to find, with a unique identifier for each version ❌
    • Metadata (summary info, eg License, Website, coding language) ❌
  • Accessible
    • Retrievable using a free and open protocol ✅
    • Metadata are accessible, even when the software is no longer available ❌
  • Interoperable
    • Software uses data in a way that meets community standards ✅
    • Software includes references to other objects ❌
  • Reusable
    • Metadata (how to use) and License ❌
    • Detailled provenance (information on its context, maintainers and dependencies) ❌

Introducing ✨ TopoStats 2.0 ✨

Removing hard-coded variables

  • Added config file
  • Easy editing of parameters
  • No more script editing for users
  • Configurations are saved with the outputs for reproducability

Documentation

  • Documentation now written alongside code as docstrings.
  • Every function, class and file has one.
  • docstrings describe the parameters and return values
  • They make the code easier to read
  • The documentation is hosted automatically on ReadTheDocs using the Sphinx documentation generator.
  • Easily searchable online
  • Releases automatically submitted to ORDA

Versioning

  • Stable versions are released incrementally from GitHub to PyPI, bundling features together

  • Allows users to use stable, fully documented versions while we work on developing new features

Updated GitHub Page

GitHub Issues and Milestones

  • GitHub Issues
    • For planned Features
  • GitHub Milestonees
    • For planned releases

GitHub Issues and Milestones

Linting and formatting

Making code universally understandable

  • Pylint, Black, Flake8
  • Pre-commit

Testing, Testing, Testing

  • Ensures code does what it is intended to do
  • Flags if code unexpectedly breaks

  • Uses pytest and GitHub Actions
  • Catches errors and prevents bad code from being added

Code review

  • Contribution using GitHub’s Pull Requests
  • A pull request requires approval from at least one other person on the project before being merged (accepted).
  • This prevents bad code from being added to the project

TopoStats 2.0 installation procedure

  • conda create -n topostats python=3.10
    conda activate topostats
    pip install topostats
  • run_topostats

One year on

  • Findable
    • Easy to find, with a unique identifier for each version ✅
    • Metadata (summary info, eg License, Website, coding language) ✅
  • Accessible
    • Retrievable using a free and open protocol ✅
    • Metadata are accessible, even when the software is no longer available ✅
  • Interoperable
    • Software uses data in a way that meets community standards ✅
    • Software includes references to other objects ✅
  • Reusable
    • Metadata (how to use) and License ✅
    • Detailled provenance (information on its context, maintainers and dependencies) ✅

Remaining challenges

  • Documentation versioning
  • Jupyter notebooks for improved accessibility
  • GUI?
  • Balancing openness and NDA information from external research partners.
  • Balancing internal team meetings with openness on Github.

Benefits and drawbacks of Open Research / Software

Benefits

  • Collaboration
    • Sharing data, methods, code
    • Faster and efficient progress (but not always!)
  • Inclusivity
    • Data and software are free to access
  • Greater impact
    • More accessible to a wider audience
  • Transparency and constructive criticism
    • Enables and even encourages people to critique methods that are used.
  • Reproducablility
    • Others can more easily replicate research

Drawbacks

  • Chaos in open software development
    • As the community grows, so do the feature requests and issues!
    • Competing demands of users, collaborators, work
    • Sporadic nature of working with people with varying availability
    • Relying on open source software can be risky, there are innumerable abandoned packages
  • Navigating work under NDA has been a lot harder
    • Hard to work on features since no example data
    • Cannot document certain things

Personal experience with Open Research / Software

Personal benefits

  • Communities
    • TopoStats and its related research being open has built a wonderful community that spreads globally and is growing at an increasing rate.
    • For example we have met people from the UK, France, Spain, Germany and America.
  • Professional development
    • Open research / development is great for networking and building reputation
    • Easier to advertise oneself as opposed to working on closed-source software and non-open research.
  • Social benefit of open research and software.
    • Working on open research / software is viewed positively

Personal drawbacks

  • Stress!
  • More scrutiny
  • People depend on you
  • More time demands
  • The systems can be cumbersome

Acknowledgements

I would like to thank everyone who has worked on and helped with TopoStats. This was a massive group effort. (In alphabetical order)

  • Alice Pyne
  • Billie Ward
  • Bob Turner
  • Eddie Rollins
  • Jean Du
  • Joe Beton
  • Libby Holmes
  • Max Gamill
  • Neil Shephard
  • Rob Moorehead
  • Tobi Firth
  • Tom Catley